To start this project, I have to load in the packages and data that I am going to use. I will explain what I am doing and what the variables mean below.
library(ggplot2)
library(caret)
## Loading required package: lattice
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(corrplot)
## corrplot 0.84 loaded
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
Reading in the data and also Home Runs 1 and 2 load as charecters because of the NAs, which will be ignored and I’ll get to that later, but they need to be integers and lets count the total NAs we are dealing with and double check everything is ready for data analysis. I made this by exporting data into a csv file from FanGraphs after sorting columns and excluding players with less than 200 plate appearances and performing a VLOOKUP for past years Home Run totals in a seperate notebook and combined it into one. The data prep took roughly an hour but was pretty painless.
library(readr)
HomeRuns <- read_csv('data/HomeRunsPredictorDataRevised.csv')
## Parsed with column specification:
## cols(
## .default = col_double(),
## Name = col_character(),
## HR = col_integer(),
## Age = col_integer(),
## PA = col_integer(),
## Doubles = col_integer(),
## HR1 = col_character(),
## HR2 = col_character()
## )
## See spec(...) for full column specifications.
HomeRuns$HR1 <- as.integer(HomeRuns$HR1)
## Warning: NAs introduced by coercion
HomeRuns$HR2 <- as.integer(HomeRuns$HR2)
## Warning: NAs introduced by coercion
sum(is.na(HomeRuns$HR1))
## [1] 78
sum(is.na(HomeRuns$HR2))
## [1] 130
str(HomeRuns)
## Classes 'tbl_df', 'tbl' and 'data.frame': 355 obs. of 38 variables:
## $ Name : chr "Max Muncy" "Juan Soto" "Shohei Ohtani" "Ronald Acuna Jr." ...
## $ HR : int 35 22 22 26 3 10 7 27 16 24 ...
## $ Age : int 27 19 23 20 26 27 24 23 22 21 ...
## $ PA : int 481 494 367 487 248 221 334 606 285 484 ...
## $ Doubles : int 17 25 21 26 11 14 16 47 9 16 ...
## $ BBPct : num 0.164 0.16 0.101 0.092 0.056 0.118 0.147 0.041 0.084 0.087 ...
## $ KPct : num 0.272 0.2 0.278 0.253 0.097 0.249 0.138 0.16 0.281 0.252 ...
## $ BB_K : num 0.6 0.8 0.36 0.37 0.58 0.47 1.07 0.26 0.3 0.34 ...
## $ OBP : num 0.391 0.406 0.361 0.366 0.381 0.357 0.405 0.328 0.34 0.34 ...
## $ BABIP : num 0.299 0.338 0.35 0.352 0.359 0.315 0.336 0.316 0.345 0.321 ...
## $ GB_FB : num 0.76 1.87 1.32 1.07 0.97 1.23 1.24 1.23 1.65 0.77 ...
## $ LDPct : num 0.208 0.175 0.236 0.183 0.216 0.213 0.24 0.202 0.21 0.245 ...
## $ GBPct : num 0.343 0.537 0.436 0.423 0.387 0.434 0.421 0.44 0.492 0.328 ...
## $ FBPct : num 0.449 0.288 0.329 0.394 0.397 0.353 0.339 0.358 0.298 0.427 ...
## $ HR_FB : num 0.294 0.247 0.297 0.211 0.038 0.208 0.089 0.157 0.296 0.179 ...
## $ wFB : num 27.1 32.1 18.5 17.3 0.7 7.5 10.4 12.1 6.9 13.1 ...
## $ wSL : num -1.8 -3 0.1 0.3 3.1 0 1.3 8 3.3 -2.7 ...
## $ wCT : num 3.1 -0.3 -1.2 1.9 0.9 0.6 -0.3 2.2 -0.5 -1.3 ...
## $ wCB : num -0.3 1 2.7 4.2 2.3 0.9 1.9 5.9 1.5 2.1 ...
## $ wCH : num 4.2 0 2.5 7 4.3 0 0.7 1 -0.8 1 ...
## $ wSF : num 4.1 -0.9 0.5 -0.2 1 0.7 0.9 -1.4 NA -0.5 ...
## $ OSwingPct : num 0.215 0.219 0.323 0.275 0.353 0.234 0.222 0.394 0.317 0.344 ...
## $ ZSwingPct : num 0.578 0.607 0.653 0.728 0.842 0.653 0.653 0.74 0.691 0.687 ...
## $ SwingPct : num 0.37 0.388 0.457 0.461 0.56 0.409 0.408 0.531 0.464 0.484 ...
## $ OContactPct: num 0.58 0.681 0.591 0.598 0.749 0.512 0.704 0.697 0.543 0.559 ...
## $ ZContactPct: num 0.804 0.857 0.806 0.827 0.909 0.832 0.924 0.918 0.81 0.818 ...
## $ ContactPct : num 0.729 0.801 0.716 0.746 0.851 0.725 0.856 0.819 0.699 0.709 ...
## $ ZonePct : num 0.426 0.436 0.406 0.409 0.422 0.418 0.431 0.396 0.393 0.409 ...
## $ FStrikePct : num 0.559 0.575 0.583 0.62 0.645 0.566 0.542 0.663 0.632 0.622 ...
## $ SwStrPct : num 0.1 0.077 0.13 0.117 0.084 0.112 0.059 0.096 0.14 0.141 ...
## $ PullPct : num 0.447 0.361 0.369 0.438 0.356 0.416 0.371 0.475 0.337 0.422 ...
## $ CentPct : num 0.312 0.364 0.373 0.361 0.351 0.307 0.371 0.292 0.431 0.327 ...
## $ OppoPct : num 0.241 0.275 0.258 0.201 0.293 0.277 0.257 0.233 0.232 0.251 ...
## $ SoftPct : num 0.124 0.203 0.102 0.137 0.22 0.139 0.118 0.194 0.149 0.14 ...
## $ MedPct : num 0.402 0.449 0.467 0.419 0.478 0.423 0.443 0.446 0.409 0.476 ...
## $ HardPct : num 0.474 0.348 0.431 0.444 0.302 0.438 0.439 0.36 0.442 0.384 ...
## $ HR1 : int NA NA NA NA NA NA NA NA NA NA ...
## $ HR2 : int NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 38
## .. ..$ Name : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ HR : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Age : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ PA : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Doubles : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ BBPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ KPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ BB_K : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ OBP : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ BABIP : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ GB_FB : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ LDPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ GBPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ FBPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ HR_FB : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ wFB : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ wSL : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ wCT : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ wCB : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ wCH : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ wSF : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ OSwingPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ ZSwingPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ SwingPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ OContactPct: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ ZContactPct: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ ContactPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ ZonePct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ FStrikePct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ SwStrPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ PullPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ CentPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ OppoPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ SoftPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ MedPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ HardPct : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ HR1 : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ HR2 : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
I am attempting to create a regression model to accurately predict individual players Home Run totals. I have created a spreadsheet using data from FanGraphs.com and I will measure each variable and look to see what variables I might be more inclined to use in finding the right model. If I ended up using all of these, I would have a very overfit model and the best models are often the most simple. I excluded any stats that heavily rely on Home Run data to calculate, so for example, there is no slugging percentage because a big part of that equation is Home Runs. I kept things like on base percentage though because home runs are weighted the same as a single in this calculation so it’s more of a measure of hitting ability than power ability. I also included Home Run to flyball ratio because they do track this and it can be above or below league average, but I likely will refrain from using it in the model because it directly uses home runs in the calculation. It will still be helpful to see some statistics about this. I will do that in this document and do the regression analysis in the next document. I will give a short explination for each of the variables below. Here are the initial inputs.
Name: The players name in the data set, used here for confirmation all the data matches up
HR: The dependent variable in this study, home runs the player hit
Age: The players age
PA: The amount of time a player bats including walks, hit by pitches, sacirfices and so on
Doubles: The amount of doubles a player hits
BBPct: The amount of walks a player has divided by plate appearances
KPct: The amount of strikeouts a player has divided by plate appearances
BB_K: A hitters strikeout to walk ratio
OBP: The times a player reaches base via hit, walk, HBP divided by plate appearances
BABIP: A players batting average on balls they into play (excludes strikeouts)
GB_FB: A hitters ground ball to fly ball ratio
LDPct: Line Drive Percentage is the percentage of balls a player hits that are classified as line drives
GBPct: Ground Ball Percentage is the percentage of balls a player hits that are classified as ground balls
FBPct: Fly Ball Percentage is the percentage of balls a player hits that are classified as fly balls
HR_FB: The percentage of fly balls that are home runs for a player
wFB/wSL/wCT/wCB/wCH/wSF: Linear weights of run expectancy for each pitch type, so if a player hits a double on a 2-0 count off of a fastball, their wFB would get the linear weight equal to that, then if they strike out on a fastball next at-bat they’d get minus a linear weight for the same stat. Fastball, Slider, Cutter, Curveball, Changeup and Splitter
OSwingPct: Swings at pitches outside of the strikezone for a player divided by the players pitches seen outside the strikezone
ZSwingPct: Swings at pitches inside the strikezone for a player divided by the players pitches seen inside the strikezone
SwingPct: Swings divded by pitches to a player
OContactPct: Swings in which a player made contact on pitches outside of the strikezone divided by pitches seen outside the zone and swung at by the player
ZContactPct: Swings in which a player made contact on pitches inside of the strikezone divided by the pitches seen and swung at inside the zone for a player
ContactPct: Number of pitches in which contact was made divided by swings for a player
ZonePct: Percentage of pitches seen inside the strikezone for a player by total pitches they have seen
FStrikePct: Percentage of first pitch strikes seen by a player in plate appearances
SwStrPct: Swings and misses by a player divided by total pitches seen
Pull/Center/OppoPct: A player has a pull/center/opposite field, for a lefty their pull field is right field and righty their pull field is left field, center is center field area for both types of hitters and opposite field is the opposite of their pull field, this is the percentage of a players hits that went each way Soft/Med/HardPct: Percentage of a players contact on batted balls classified by how they hit the ball, sum to 100%
HR1: Home Runs from 2017 (previous season)
HR2: Home Runs from 2016 (Two season prior)
summary(HomeRuns)
## Name HR Age PA
## Length:355 Min. : 1.00 Min. :19.00 Min. :200.0
## Class :character 1st Qu.: 7.00 1st Qu.:25.00 1st Qu.:322.0
## Mode :character Median :13.00 Median :28.00 Median :456.0
## Mean :14.45 Mean :28.19 Mean :449.4
## 3rd Qu.:20.00 3rd Qu.:31.00 3rd Qu.:580.5
## Max. :48.00 Max. :39.00 Max. :745.0
##
## Doubles BBPct KPct BB_K
## Min. : 3.00 Min. :0.01500 Min. :0.0730 Min. :0.1000
## 1st Qu.:13.00 1st Qu.:0.06400 1st Qu.:0.1760 1st Qu.:0.2900
## Median :19.00 Median :0.08400 Median :0.2160 Median :0.4000
## Mean :20.85 Mean :0.08624 Mean :0.2174 Mean :0.4263
## 3rd Qu.:27.00 3rd Qu.:0.10500 3rd Qu.:0.2535 3rd Qu.:0.5250
## Max. :51.00 Max. :0.20100 Max. :0.3850 Max. :1.3300
##
## OBP BABIP GB_FB LDPct
## Min. :0.2320 Min. :0.1890 Min. :0.510 Min. :0.1400
## 1st Qu.:0.2985 1st Qu.:0.2725 1st Qu.:0.970 1st Qu.:0.1915
## Median :0.3220 Median :0.2980 Median :1.180 Median :0.2130
## Mean :0.3224 Mean :0.2972 Mean :1.269 Mean :0.2149
## 3rd Qu.:0.3450 3rd Qu.:0.3200 3rd Qu.:1.495 3rd Qu.:0.2370
## Max. :0.4600 Max. :0.4060 Max. :3.900 Max. :0.3230
##
## GBPct FBPct HR_FB wFB
## Min. :0.2400 Min. :0.1620 Min. :0.0120 Min. :-16.500
## 1st Qu.:0.3800 1st Qu.:0.3110 1st Qu.:0.0840 1st Qu.: -3.300
## Median :0.4240 Median :0.3620 Median :0.1230 Median : 2.400
## Mean :0.4264 Mean :0.3587 Mean :0.1282 Mean : 3.579
## 3rd Qu.:0.4745 3rd Qu.:0.4010 3rd Qu.:0.1690 3rd Qu.: 8.800
## Max. :0.6310 Max. :0.5170 Max. :0.3500 Max. : 40.100
##
## wSL wCT wCB wCH
## Min. :-12.4000 Min. :-6.4000 Min. :-7.6000 Min. :-13.6000
## 1st Qu.: -3.1000 1st Qu.:-1.4000 1st Qu.:-1.5000 1st Qu.: -1.4000
## Median : -1.0000 Median : 0.0000 Median : 0.5000 Median : 0.2000
## Mean : -0.7946 Mean : 0.0862 Mean : 0.5485 Mean : 0.5099
## 3rd Qu.: 1.8000 3rd Qu.: 1.4000 3rd Qu.: 2.2000 3rd Qu.: 2.1500
## Max. : 12.1000 Max. : 8.5000 Max. :11.9000 Max. : 16.4000
##
## wSF OSwingPct ZSwingPct SwingPct
## Min. :-3.50000 Min. :0.1440 Min. :0.5100 Min. :0.3430
## 1st Qu.:-0.60000 1st Qu.:0.2675 1st Qu.:0.6405 1st Qu.:0.4320
## Median :-0.10000 Median :0.3100 Median :0.6800 Median :0.4680
## Mean : 0.01048 Mean :0.3092 Mean :0.6789 Mean :0.4675
## 3rd Qu.: 0.60000 3rd Qu.:0.3510 3rd Qu.:0.7165 3rd Qu.:0.4995
## Max. : 5.00000 Max. :0.4840 Max. :0.8510 Max. :0.6110
## NA's :2
## OContactPct ZContactPct ContactPct ZonePct
## Min. :0.4180 Min. :0.7020 Min. :0.6100 Min. :0.374
## 1st Qu.:0.5705 1st Qu.:0.8275 1st Qu.:0.7320 1st Qu.:0.413
## Median :0.6340 Median :0.8630 Median :0.7730 Median :0.429
## Mean :0.6311 Mean :0.8592 Mean :0.7736 Mean :0.428
## 3rd Qu.:0.6910 3rd Qu.:0.8935 3rd Qu.:0.8145 3rd Qu.:0.442
## Max. :0.8360 Max. :0.9730 Max. :0.9110 Max. :0.483
##
## FStrikePct SwStrPct PullPct CentPct
## Min. :0.4910 Min. :0.0360 Min. :0.2550 Min. :0.2180
## 1st Qu.:0.5760 1st Qu.:0.0835 1st Qu.:0.3770 1st Qu.:0.3195
## Median :0.6030 Median :0.1050 Median :0.4080 Median :0.3410
## Mean :0.6027 Mean :0.1066 Mean :0.4091 Mean :0.3428
## 3rd Qu.:0.6260 3rd Qu.:0.1295 3rd Qu.:0.4480 3rd Qu.:0.3640
## Max. :0.6980 Max. :0.2380 Max. :0.5870 Max. :0.4490
##
## OppoPct SoftPct MedPct HardPct
## Min. :0.1590 Min. :0.0840 Min. :0.348 Min. :0.1910
## 1st Qu.:0.2170 1st Qu.:0.1535 1st Qu.:0.429 1st Qu.:0.3185
## Median :0.2460 Median :0.1780 Median :0.461 Median :0.3620
## Mean :0.2482 Mean :0.1768 Mean :0.463 Mean :0.3604
## 3rd Qu.:0.2750 3rd Qu.:0.1990 3rd Qu.:0.494 3rd Qu.:0.3995
## Max. :0.3620 Max. :0.3070 Max. :0.612 Max. :0.5090
##
## HR1 HR2
## Min. : 0.00 Min. : 0.00
## 1st Qu.:10.00 1st Qu.: 8.00
## Median :17.00 Median :15.00
## Mean :17.84 Mean :16.71
## 3rd Qu.:25.00 3rd Qu.:24.00
## Max. :59.00 Max. :47.00
## NA's :78 NA's :130
Everything looks to be lined up well to run correlation and some graphs, all the information is as expected, we have some NAs for HR1 and HR2 but I will build a seperate model for players without these numbers. One issue that I can not fix in the scope of this project would be project home runs for a potential rookie, that would require extra analysis of players rookie seasons and likely I’d match them up in a tiered group of prospect rankings from either MLB.com or Baseball America. I could also make a model without past homers or I could run a players minor league stats through this and regress it to the mean and then deduct a percentage for facing a lower level of competition. Minor league to major league equivalent stats actually DO exist! Anyways let’s just tackle finding out what variables to use in this regression model so we can make our semi-decent home run prediction model.
corHR1 <- cor(HomeRuns[,2:5])
corrplot.mixed(corHR1)
corHR2 <- cor(HomeRuns[,c(2,6:10)])
corrplot.mixed(corHR2)
corHR3 <- cor(HomeRuns[,c(2,11:15)])
corrplot(corHR3, method = "pie")
corHR4 <- cor(HomeRuns[,c(2,16:20)])
corrplot(corHR4, method = 'square')
corHR4v2 <- cor(HomeRuns[,c(2,16:20)])
corrplot.mixed(corHR4v2)
Wow I can see why R is a great statistical program at this point. You can’t do this in Excel so quickly without spending a lot of time an effort, especially with no add-ins. I added Home Runs to each correlation plot because these are the dependent variable and I would like to measure everything against them mostly. I know I can use interaction between to variables in a regression model, but I’d like to keep this model more simple and I don’t think I really need to use them since I have many useful ratios at my disposal already. Age is an issue here because almost all projection system account for it. As you get older you tned to decline unless you’re Barry Bonds. I’ll have to figure out a way to incorporate this into model since the correlation is at -.09 or basically nothing. The big winners so far are plate appearances, doubles, on base percentage, linear weighted fastball runs, flyballs hit and to a certain extent walks. I was hoping for higher line drive correlation since line drives have the highest batting average of any hit, but it makes sense that more would fall in for hits than leave the playing field. I also like the ground ball percentage negative effect, as unless it’s the rare inside the park homer you will not be hitting a home run on a ground ball. Another disappointment was that FanGraphs didn’t have MLB’s new Statvast data avaialable because average fly ball distance and barrels, which are highly well hit balls in the top tier of batted balls measured by am percentage, are tracked. If I had to add anything to this model those would likely be really useful and it’s definitely something to think about in the future of adding, but for right now I still think we can build this off of FanGraphs data, and these might have even been redudent of that.
corHR5 <- cor(HomeRuns[,c(2,21:25)])
corrplot.mixed(corHR5)
corHR6 <- cor(HomeRuns[,c(2,26:30)])
corrplot.mixed(corHR6)
corHR7 <- cor(HomeRuns[,c(2,31:36)])
corrplot.mixed(corHR7)
corHR8 <- cor(na.omit(HomeRuns[,c(2,37:38)]))
corrplot.mixed(corHR8)
A lot more “meh” here. There’s a semi-strong negative correlation between ZonePct and HR, which is odd but makes sense in a way because good hitters wouldn’t see a lot of strikes overall. Other than that only pull and hard percentages have a positive correlation. Medium contact has worse negative correlation than soft contact. Home Runs from 2017 and 2016 are highly correlated which I will go over below. Most of these correlations are between -.5 and .5 because there’s a decent amount of data and truly it’s hard to find one true killer variable because while some guys who hit the ball to their pull field for homers, there’s another hitter who has a different approach and uses the opposite field as well, former Tiger JD Martinez and current Tiger Miguel Cabrera are known for this. JD hit many home runs near me when I used to have season tickets in right field despite being a right handed hitter. For guys like that there’s hitters like Joey Gallo who when they hit everybody moves the right side of the field since he hits left handed and will almost always pull the ball. Due to this, the regression model is a bit harder to make and that’s why it’ll be multiple regression.
corHRMatrix <- cor(HomeRuns[,c(2:38)])
corHRMatrix
## HR Age PA Doubles
## HR 1.00000000 -0.090299445 0.6792503006 0.584693228
## Age -0.09029944 1.000000000 -0.0397789691 -0.095761174
## PA 0.67925030 -0.039778969 1.0000000000 0.835024546
## Doubles 0.58469323 -0.095761174 0.8350245455 1.000000000
## BBPct 0.27816333 0.046958324 0.1588074021 0.101973610
## KPct 0.04049126 -0.190044686 -0.2928936593 -0.311309323
## BB_K 0.18908870 0.147434367 0.3056301292 0.287728584
## OBP 0.41247124 -0.018245456 0.4321573710 0.467514825
## BABIP 0.03970784 -0.206690961 0.1846870810 0.287115482
## GB_FB -0.35196868 -0.069689964 -0.0701439952 -0.160348092
## LDPct -0.14456899 0.133322517 0.0638513646 0.151910600
## GBPct -0.32387290 -0.112792398 -0.0919772857 -0.192921052
## FBPct 0.40055732 0.048617464 0.0617214386 0.120855547
## HR_FB 0.73868672 -0.128780172 0.1980686668 0.132992884
## wFB 0.63918558 -0.090182610 0.4778438482 0.513651108
## wSL 0.30023683 -0.090069487 0.1813668111 0.289053427
## wCT 0.28691416 -0.017491911 0.1166053037 0.148138674
## wCB 0.32539861 -0.037472126 0.3095655582 0.402145790
## wCH 0.35911480 -0.045297853 0.2819706030 0.343760113
## wSF NA NA NA NA
## OSwingPct 0.02271830 -0.134566267 0.0004340958 0.033283315
## ZSwingPct 0.14228903 -0.129429038 0.0530311210 0.078024478
## SwingPct 0.01455212 -0.148569701 -0.0032899893 0.031040370
## OContactPct -0.08444999 0.136280545 0.2448017773 0.261809784
## ZContactPct -0.15038268 0.158778484 0.1612281754 0.188456345
## ContactPct -0.15630029 0.175901601 0.2011620072 0.221898877
## ZonePct -0.46927650 0.045850821 -0.1930785642 -0.196194487
## FStrikePct -0.21969017 -0.147931240 -0.1340779679 -0.085383820
## SwStrPct 0.13706832 -0.209896180 -0.1726690164 -0.174954281
## PullPct 0.28241832 0.028079302 0.0186140567 0.075579889
## CentPct -0.09783550 -0.049289224 0.0423617962 -0.007386045
## OppoPct -0.29573663 0.003471035 -0.0593264592 -0.094430399
## SoftPct -0.24351909 -0.141208421 -0.1582199830 -0.232756222
## MedPct -0.45121188 -0.009413684 -0.1210374428 -0.131044339
## HardPct 0.50831560 0.093370445 0.1931767672 0.246863056
## HR1 NA NA NA NA
## HR2 NA NA NA NA
## BBPct KPct BB_K OBP BABIP
## HR 0.27816333 0.040491256 0.18908870 0.41247124 0.03970784
## Age 0.04695832 -0.190044686 0.14743437 -0.01824546 -0.20669096
## PA 0.15880740 -0.292893659 0.30563013 0.43215737 0.18468708
## Doubles 0.10197361 -0.311309323 0.28772858 0.46751482 0.28711548
## BBPct 1.00000000 0.116753568 0.70463116 0.62778926 0.01490670
## KPct 0.11675357 1.000000000 -0.54837498 -0.27094953 0.12282115
## BB_K 0.70463116 -0.548374980 1.00000000 0.67240948 -0.05064999
## OBP 0.62778926 -0.270949528 0.67240948 1.00000000 0.57915552
## BABIP 0.01490670 0.122821146 -0.05064999 0.57915552 1.00000000
## GB_FB -0.13458652 -0.106730783 -0.05835622 0.03680014 0.29796570
## LDPct -0.04268523 -0.122929201 0.05363512 0.23344362 0.40779958
## GBPct -0.13876791 -0.092866223 -0.07381997 -0.03574927 0.18987760
## FBPct 0.16134957 0.155393565 0.04763281 -0.07880367 -0.39335577
## HR_FB 0.29540723 0.406246894 -0.03569535 0.31610976 0.13183381
## wFB 0.52460255 -0.053550544 0.46200638 0.76511435 0.40905236
## wSL 0.18108407 -0.148023481 0.25508069 0.44230436 0.27201567
## wCT 0.09226594 -0.008199828 0.06957319 0.23412520 0.09817940
## wCB 0.19450794 -0.209977854 0.31100888 0.45730273 0.26820397
## wCH 0.12243966 -0.187130408 0.23674788 0.38128847 0.20681404
## wSF NA NA NA NA NA
## OSwingPct -0.75840468 0.006596864 -0.61799719 -0.42863873 0.03138310
## ZSwingPct -0.34396678 0.095640721 -0.32498490 -0.15731959 0.08146441
## SwingPct -0.70474973 0.031724742 -0.58036251 -0.39404407 0.05862901
## OContactPct -0.13378822 -0.834076818 0.45789371 0.17932170 -0.10309002
## ZContactPct -0.15045233 -0.822834035 0.41157237 0.16069489 -0.10243559
## ContactPct -0.04796874 -0.878566886 0.54239225 0.23207269 -0.10947565
## ZonePct -0.08642299 -0.134980969 0.03426291 -0.13444553 -0.01141790
## FStrikePct -0.54643944 0.109559316 -0.50351270 -0.32169559 0.18188078
## SwStrPct -0.21086250 0.760045825 -0.65613645 -0.33057946 0.12645878
## PullPct 0.16324537 0.099654553 0.06701096 -0.04007681 -0.35145423
## CentPct -0.07876021 -0.022823135 -0.03963191 0.03675304 0.23587764
## OppoPct -0.15264318 -0.114427697 -0.05647888 0.02255063 0.27252854
## SoftPct -0.23849242 -0.115980782 -0.13264252 -0.32953666 -0.26101676
## MedPct -0.18400250 -0.249481993 0.03271656 -0.15278412 -0.02787313
## HardPct 0.29172264 0.269391999 0.05466916 0.32305639 0.18186881
## HR1 NA NA NA NA NA
## HR2 NA NA NA NA NA
## GB_FB LDPct GBPct FBPct
## HR -0.3519686760 -0.1445689854 -0.32387290 0.4005573202
## Age -0.0696899642 0.1333225167 -0.11279240 0.0486174645
## PA -0.0701439952 0.0638513646 -0.09197729 0.0617214386
## Doubles -0.1603480916 0.1519105995 -0.19292105 0.1208555468
## BBPct -0.1345865176 -0.0426852301 -0.13876791 0.1613495708
## KPct -0.1067307831 -0.1229292012 -0.09286622 0.1553935651
## BB_K -0.0583562197 0.0536351168 -0.07381997 0.0476328129
## OBP 0.0368001360 0.2334436186 -0.03574927 -0.0788036738
## BABIP 0.2979656966 0.4077995779 0.18987760 -0.3933557650
## GB_FB 1.0000000000 -0.0003043942 0.91993912 -0.9337892559
## LDPct -0.0003043942 1.0000000000 -0.27359435 -0.2151016007
## GBPct 0.9199391187 -0.2735943508 1.00000000 -0.8804543725
## FBPct -0.9337892559 -0.2151016007 -0.88045437 1.0000000000
## HR_FB -0.1261463115 -0.1907703963 -0.07235716 0.1683220364
## wFB -0.1133450212 0.1054214490 -0.15180723 0.1022818759
## wSL -0.0124680358 0.0996687841 -0.04856734 0.0006646876
## wCT -0.1298227000 -0.0307123516 -0.12960590 0.1477242612
## wCB -0.0516241043 0.1296920063 -0.05790534 -0.0055944335
## wCH -0.1138891794 0.0355924916 -0.12374460 0.1077281791
## wSF NA NA NA NA
## OSwingPct 0.0124566464 0.0058259051 0.01879645 -0.0215059395
## ZSwingPct -0.1227552317 0.0172294462 -0.12282024 0.1168806165
## SwingPct -0.0194849770 0.0452680924 -0.02405124 0.0026697962
## OContactPct 0.0803912600 0.1951852923 0.03505043 -0.1324669604
## ZContactPct 0.1361508489 0.1772877664 0.08555952 -0.1747404968
## ContactPct 0.1194508621 0.2144467930 0.06240218 -0.1697506856
## ZonePct 0.2173909053 0.2185186206 0.15322492 -0.2637705936
## FStrikePct 0.1525617107 0.1267104716 0.13328799 -0.1978260306
## SwStrPct -0.1017911659 -0.1636570078 -0.05681011 0.1391649295
## PullPct -0.5067682592 -0.2226650346 -0.43349559 0.5500515369
## CentPct 0.3679533936 0.0049051843 0.38512585 -0.3936000738
## OppoPct 0.3715057160 0.2917498759 0.26008550 -0.4079633056
## SoftPct 0.1067646046 -0.3343315175 0.19520366 -0.0334067017
## MedPct 0.1869774060 0.1728778312 0.16017490 -0.2479710865
## HardPct -0.2148279887 0.0669851896 -0.24762033 0.2185057154
## HR1 NA NA NA NA
## HR2 NA NA NA NA
## HR_FB wFB wSL wCT
## HR 0.7386867242 0.63918558 0.3002368323 0.286914156
## Age -0.1287801724 -0.09018261 -0.0900694867 -0.017491911
## PA 0.1980686668 0.47784385 0.1813668111 0.116605304
## Doubles 0.1329928843 0.51365111 0.2890534268 0.148138674
## BBPct 0.2954072288 0.52460255 0.1810840709 0.092265936
## KPct 0.4062468941 -0.05355054 -0.1480234809 -0.008199828
## BB_K -0.0356953517 0.46200638 0.2550806883 0.069573190
## OBP 0.3161097634 0.76511435 0.4423043632 0.234125201
## BABIP 0.1318338142 0.40905236 0.2720156712 0.098179399
## GB_FB -0.1261463115 -0.11334502 -0.0124680358 -0.129822700
## LDPct -0.1907703963 0.10542145 0.0996687841 -0.030712352
## GBPct -0.0723571574 -0.15180723 -0.0485673406 -0.129605903
## FBPct 0.1683220364 0.10228188 0.0006646876 0.147724261
## HR_FB 1.0000000000 0.51116753 0.2639483859 0.234172474
## wFB 0.5111675298 1.00000000 0.2394326503 0.144922210
## wSL 0.2639483859 0.23943265 1.0000000000 0.092205408
## wCT 0.2341724735 0.14492221 0.0922054082 1.000000000
## wCB 0.2167180377 0.29520463 0.3085986516 0.102155709
## wCH 0.1922032451 0.26352985 0.1737058840 0.169048250
## wSF NA NA NA NA
## OSwingPct 0.0107050718 -0.25398601 -0.1298770019 -0.043346315
## ZSwingPct 0.1309564234 -0.06494049 0.0471104613 0.086105475
## SwingPct 0.0007563243 -0.25152059 -0.0874160654 0.005088544
## OContactPct -0.4155650298 -0.00634712 0.1125594328 -0.007233188
## ZContactPct -0.4571648503 -0.04960880 0.0661154335 -0.007179248
## ContactPct -0.4918536865 -0.01157864 0.1109990610 0.002086832
## ZonePct -0.4731224484 -0.29106451 -0.1262969942 -0.055578529
## FStrikePct -0.1434723497 -0.28450281 -0.0813124868 0.020558849
## SwStrPct 0.4173163509 -0.07018229 -0.1190874504 0.003222963
## PullPct 0.2193148616 0.08990250 0.0351485606 0.113862998
## CentPct -0.0333339038 -0.01829294 -0.0242632085 -0.017897043
## OppoPct -0.2649926826 -0.10571829 -0.0269560651 -0.135982430
## SoftPct -0.2522301764 -0.34566584 -0.1440752593 -0.098518572
## MedPct -0.5112483360 -0.25240675 -0.2323547226 -0.106019137
## HardPct 0.5612258398 0.41249309 0.2731754590 0.144834413
## HR1 NA NA NA NA
## HR2 NA NA NA NA
## wCB wCH wSF OSwingPct ZSwingPct
## HR 0.325398614 0.359114802 NA 0.0227183025 0.142289027
## Age -0.037472126 -0.045297853 NA -0.1345662674 -0.129429038
## PA 0.309565558 0.281970603 NA 0.0004340958 0.053031121
## Doubles 0.402145790 0.343760113 NA 0.0332833150 0.078024478
## BBPct 0.194507937 0.122439659 NA -0.7584046828 -0.343966784
## KPct -0.209977854 -0.187130408 NA 0.0065968635 0.095640721
## BB_K 0.311008879 0.236747881 NA -0.6179971902 -0.324984902
## OBP 0.457302734 0.381288466 NA -0.4286387277 -0.157319587
## BABIP 0.268203965 0.206814042 NA 0.0313830952 0.081464412
## GB_FB -0.051624104 -0.113889179 NA 0.0124566464 -0.122755232
## LDPct 0.129692006 0.035592492 NA 0.0058259051 0.017229446
## GBPct -0.057905345 -0.123744595 NA 0.0187964453 -0.122820236
## FBPct -0.005594433 0.107728179 NA -0.0215059395 0.116880617
## HR_FB 0.216718038 0.192203245 NA 0.0107050718 0.130956423
## wFB 0.295204627 0.263529852 NA -0.2539860109 -0.064940488
## wSL 0.308598652 0.173705884 NA -0.1298770019 0.047110461
## wCT 0.102155709 0.169048250 NA -0.0433463150 0.086105475
## wCB 1.000000000 0.159257943 NA -0.1003841480 0.044316237
## wCH 0.159257943 1.000000000 NA -0.0630303334 -0.033961531
## wSF NA NA 1 NA NA
## OSwingPct -0.100384148 -0.063030333 NA 1.0000000000 0.570731355
## ZSwingPct 0.044316237 -0.033961531 NA 0.5707313553 1.000000000
## SwingPct -0.056137909 -0.076251715 NA 0.9133532788 0.835098735
## OContactPct 0.174630545 0.203181155 NA -0.0001205896 -0.227267222
## ZContactPct 0.106715841 0.184121828 NA -0.0662284193 -0.319200707
## ContactPct 0.164669555 0.200341757 NA -0.1936958173 -0.337996931
## ZonePct -0.056045293 -0.099945805 NA -0.3681043660 -0.337803027
## FStrikePct -0.067711094 -0.150868015 NA 0.5125849177 0.386497764
## SwStrPct -0.155630966 -0.200726521 NA 0.4834862181 0.577177991
## PullPct 0.034609794 0.007303391 NA -0.0846409962 0.008440565
## CentPct -0.041503161 0.007441391 NA 0.0474219273 -0.020530538
## OppoPct -0.012968316 -0.015431467 NA 0.0728791191 0.004313378
## SoftPct -0.099836175 -0.151159641 NA 0.1818296773 0.018954245
## MedPct -0.152511374 -0.118252104 NA 0.0017208815 -0.081453268
## HardPct 0.182304371 0.186229199 NA -0.1122019349 0.054000561
## HR1 NA NA NA NA NA
## HR2 NA NA NA NA NA
## SwingPct OContactPct ZContactPct ContactPct
## HR 0.0145521214 -0.0844499938 -0.150382678 -0.1563002879
## Age -0.1485697005 0.1362805453 0.158778484 0.1759016013
## PA -0.0032899893 0.2448017773 0.161228175 0.2011620072
## Doubles 0.0310403702 0.2618097844 0.188456345 0.2218988774
## BBPct -0.7047497291 -0.1337882239 -0.150452326 -0.0479687407
## KPct 0.0317247418 -0.8340768177 -0.822834035 -0.8785668856
## BB_K -0.5803625081 0.4578937070 0.411572373 0.5423922549
## OBP -0.3940440653 0.1793216996 0.160694891 0.2320726873
## BABIP 0.0586290130 -0.1030900221 -0.102435591 -0.1094756461
## GB_FB -0.0194849770 0.0803912600 0.136150849 0.1194508621
## LDPct 0.0452680924 0.1951852923 0.177287766 0.2144467930
## GBPct -0.0240512446 0.0350504313 0.085559516 0.0624021796
## FBPct 0.0026697962 -0.1324669604 -0.174740497 -0.1697506856
## HR_FB 0.0007563243 -0.4155650298 -0.457164850 -0.4918536865
## wFB -0.2515205869 -0.0063471202 -0.049608796 -0.0115786393
## wSL -0.0874160654 0.1125594328 0.066115434 0.1109990610
## wCT 0.0050885440 -0.0072331879 -0.007179248 0.0020868317
## wCB -0.0561379091 0.1746305446 0.106715841 0.1646695548
## wCH -0.0762517149 0.2031811549 0.184121828 0.2003417570
## wSF NA NA NA NA
## OSwingPct 0.9133532788 -0.0001205896 -0.066228419 -0.1936958173
## ZSwingPct 0.8350987348 -0.2272672218 -0.319200707 -0.3379969308
## SwingPct 1.0000000000 -0.0809975314 -0.165283894 -0.2458807583
## OContactPct -0.0809975314 1.0000000000 0.722964542 0.9135155999
## ZContactPct -0.1652838942 0.7229645419 1.000000000 0.9089174379
## ContactPct -0.2458807583 0.9135155999 0.908917438 1.0000000000
## ZonePct -0.2687880782 0.2072486471 0.250178433 0.3575362118
## FStrikePct 0.5609406142 -0.1106820224 -0.106444210 -0.1652682671
## SwStrPct 0.5589593810 -0.8058474304 -0.827716206 -0.9356948442
## PullPct -0.0818921086 -0.1278886546 -0.133970200 -0.1358170532
## CentPct 0.0290314912 -0.0117384871 0.021346534 -0.0005122881
## OppoPct 0.0841144115 0.1802365973 0.161830479 0.1820991093
## SoftPct 0.1406332107 0.1054229332 0.082505919 0.0718604265
## MedPct 0.0125934349 0.2838620859 0.302412561 0.3282332809
## HardPct -0.0953584615 -0.2903799000 -0.291338610 -0.3052859633
## HR1 NA NA NA NA
## HR2 NA NA NA NA
## ZonePct FStrikePct SwStrPct PullPct
## HR -0.46927650 -0.21969017 0.137068322 0.282418324
## Age 0.04585082 -0.14793124 -0.209896180 0.028079302
## PA -0.19307856 -0.13407797 -0.172669016 0.018614057
## Doubles -0.19619449 -0.08538382 -0.174954281 0.075579889
## BBPct -0.08642299 -0.54643944 -0.210862498 0.163245370
## KPct -0.13498097 0.10955932 0.760045825 0.099654553
## BB_K 0.03426291 -0.50351270 -0.656136445 0.067010960
## OBP -0.13444553 -0.32169559 -0.330579464 -0.040076807
## BABIP -0.01141790 0.18188078 0.126458775 -0.351454232
## GB_FB 0.21739091 0.15256171 -0.101791166 -0.506768259
## LDPct 0.21851862 0.12671047 -0.163657008 -0.222665035
## GBPct 0.15322492 0.13328799 -0.056810105 -0.433495588
## FBPct -0.26377059 -0.19782603 0.139164929 0.550051537
## HR_FB -0.47312245 -0.14347235 0.417316351 0.219314862
## wFB -0.29106451 -0.28450281 -0.070182287 0.089902497
## wSL -0.12629699 -0.08131249 -0.119087450 0.035148561
## wCT -0.05557853 0.02055885 0.003222963 0.113862998
## wCB -0.05604529 -0.06771109 -0.155630966 0.034609794
## wCH -0.09994581 -0.15086801 -0.200726521 0.007303391
## wSF NA NA NA NA
## OSwingPct -0.36810437 0.51258492 0.483486218 -0.084640996
## ZSwingPct -0.33780303 0.38649776 0.577177991 0.008440565
## SwingPct -0.26878808 0.56094061 0.558959381 -0.081892109
## OContactPct 0.20724865 -0.11068202 -0.805847430 -0.127888655
## ZContactPct 0.25017843 -0.10644421 -0.827716206 -0.133970200
## ContactPct 0.35753621 -0.16526827 -0.935694844 -0.135817053
## ZonePct 1.00000000 0.11716484 -0.385315235 -0.181998686
## FStrikePct 0.11716484 1.00000000 0.340808048 -0.159518397
## SwStrPct -0.38531523 0.34080805 1.000000000 0.075705980
## PullPct -0.18199869 -0.15951840 0.075705980 1.000000000
## CentPct 0.04532456 0.07139487 0.022704421 -0.661193613
## OppoPct 0.20628607 0.15306128 -0.120603901 -0.788003499
## SoftPct 0.04696737 0.12579692 -0.009844386 -0.026838058
## MedPct 0.34335602 0.09170947 -0.265091060 -0.230908482
## HardPct -0.30177625 -0.14967654 0.217320532 0.200570593
## HR1 NA NA NA NA
## HR2 NA NA NA NA
## CentPct OppoPct SoftPct MedPct
## HR -0.0978354984 -0.295736629 -0.243519089 -0.451211879
## Age -0.0492892245 0.003471035 -0.141208421 -0.009413684
## PA 0.0423617962 -0.059326459 -0.158219983 -0.121037443
## Doubles -0.0073860446 -0.094430399 -0.232756222 -0.131044339
## BBPct -0.0787602115 -0.152643178 -0.238492417 -0.184002498
## KPct -0.0228231348 -0.114427697 -0.115980782 -0.249481993
## BB_K -0.0396319090 -0.056478884 -0.132642521 0.032716560
## OBP 0.0367530351 0.022550630 -0.329536658 -0.152784115
## BABIP 0.2358776352 0.272528536 -0.261016764 -0.027873125
## GB_FB 0.3679533936 0.371505716 0.106764605 0.186977406
## LDPct 0.0049051843 0.291749876 -0.334331518 0.172877831
## GBPct 0.3851258508 0.260085504 0.195203663 0.160174902
## FBPct -0.3936000738 -0.407963306 -0.033406702 -0.247971087
## HR_FB -0.0333339038 -0.264992683 -0.252230176 -0.511248336
## wFB -0.0182929407 -0.105718290 -0.345665844 -0.252406747
## wSL -0.0242632085 -0.026956065 -0.144075259 -0.232354723
## wCT -0.0178970435 -0.135982430 -0.098518572 -0.106019137
## wCB -0.0415031613 -0.012968316 -0.099836175 -0.152511374
## wCH 0.0074413906 -0.015431467 -0.151159641 -0.118252104
## wSF NA NA NA NA
## OSwingPct 0.0474219273 0.072879119 0.181829677 0.001720882
## ZSwingPct -0.0205305378 0.004313378 0.018954245 -0.081453268
## SwingPct 0.0290314912 0.084114412 0.140633211 0.012593435
## OContactPct -0.0117384871 0.180236597 0.105422933 0.283862086
## ZContactPct 0.0213465338 0.161830479 0.082505919 0.302412561
## ContactPct -0.0005122881 0.182099109 0.071860427 0.328233281
## ZonePct 0.0453245614 0.206286065 0.046967373 0.343356019
## FStrikePct 0.0713948690 0.153061281 0.125796924 0.091709467
## SwStrPct 0.0227044207 -0.120603901 -0.009844386 -0.265091060
## PullPct -0.6611936132 -0.788003499 -0.026838058 -0.230908482
## CentPct 1.0000000000 0.059222688 -0.008133611 0.036553617
## OppoPct 0.0592226880 1.000000000 0.042802076 0.278599212
## SoftPct -0.0081336113 0.042802076 1.000000000 -0.008202781
## MedPct 0.0365536171 0.278599212 -0.008202781 1.000000000
## HardPct -0.0241122719 -0.248419212 -0.604347852 -0.791692804
## HR1 NA NA NA NA
## HR2 NA NA NA NA
## HardPct HR1 HR2
## HR 0.50831560 NA NA
## Age 0.09337045 NA NA
## PA 0.19317677 NA NA
## Doubles 0.24686306 NA NA
## BBPct 0.29172264 NA NA
## KPct 0.26939200 NA NA
## BB_K 0.05466916 NA NA
## OBP 0.32305639 NA NA
## BABIP 0.18186881 NA NA
## GB_FB -0.21482799 NA NA
## LDPct 0.06698519 NA NA
## GBPct -0.24762033 NA NA
## FBPct 0.21850572 NA NA
## HR_FB 0.56122584 NA NA
## wFB 0.41249309 NA NA
## wSL 0.27317546 NA NA
## wCT 0.14483441 NA NA
## wCB 0.18230437 NA NA
## wCH 0.18622920 NA NA
## wSF NA NA NA
## OSwingPct -0.11220193 NA NA
## ZSwingPct 0.05400056 NA NA
## SwingPct -0.09535846 NA NA
## OContactPct -0.29037990 NA NA
## ZContactPct -0.29133861 NA NA
## ContactPct -0.30528596 NA NA
## ZonePct -0.30177625 NA NA
## FStrikePct -0.14967654 NA NA
## SwStrPct 0.21732053 NA NA
## PullPct 0.20057059 NA NA
## CentPct -0.02411227 NA NA
## OppoPct -0.24841921 NA NA
## SoftPct -0.60434785 NA NA
## MedPct -0.79169280 NA NA
## HardPct 1.00000000 NA NA
## HR1 NA 1 NA
## HR2 NA NA 1
Here’s a correlation matrix which is just a big chart for everything I just did, it’s more for reference than going over results as I already highlighted the big takeaways.
ggplot(HomeRuns,aes(x=HR,y=HR1)) + geom_point() + geom_smooth(method = 'lm')
## Warning: Removed 78 rows containing non-finite values (stat_smooth).
## Warning: Removed 78 rows containing missing values (geom_point).
ggplot(HomeRuns,aes(x=HR,y=HR2)) + geom_point() + geom_smooth(method = 'lm')
## Warning: Removed 130 rows containing non-finite values (stat_smooth).
## Warning: Removed 130 rows containing missing values (geom_point).
Tom Tango, a famed baseball statatican came up with a projection system called Marcel, which I’m pretty sure is named after the monkey from “Friends.” He named it this because people bugged him to make a projection system since he does so much good statistical baseball research but in reality Tango just wasn’t interested in doing this so he quickly made a very simple system that took a players last three home run totals, an age factor and regression to the league mean. I implemented the first two into this model but since it’s regression I will add other variables rather than regressing to league mean, but it’s again something one day I could add to this model. As you can see here though, a players last two seasons of home run totals are highly correlated with future totals.
ggplot(HomeRuns,aes(x=HR, fill=Age)) + geom_histogram(binwidth = 2) + facet_wrap(~Age)
Okay, here I am just trying to see if there’s any patterns for any age in home runs in the faceted histogram, but this is just a bunch of randomness.
ggplot(HomeRuns) + geom_boxplot(aes(x = Age, y = HR)) + facet_wrap(~Age)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
We can see that the mean typically grows or stays the same froma ges 24-230 then slowly starts to go down.
HomeRuns2 <- HomeRuns$HR >= 20
HomeRuns3 <- HomeRuns[HomeRuns2,]
Let’s try and just take guys who hit 20 homers and see if power hitters show any more patterns.
ggplot(HomeRuns3,aes(x=HR, fill=Age)) + geom_histogram(binwidth = 2) + facet_wrap(~Age)
Yeah there’s not much more here as far as patterns go.
HomeRuns$Age2 <- 29 - HomeRuns$Age
HomeRuns$HRDiff <- HomeRuns$HR - HomeRuns$HR1
ggplot(HomeRuns,aes(x=HRDiff,y=Age2)) + geom_point() +geom_smooth(method = 'lm')
## Warning: Removed 78 rows containing non-finite values (stat_smooth).
## Warning: Removed 78 rows containing missing values (geom_point).
So now I’m designating 29 as my age of prime. Typically guys have their best seasons between 28-32, Miguel Cabrera won his MVPs in his age 28 and 29 seasons. Also Tom Tango uses 29 as his age in his projection system. If a guy is under 29 he gets credit for getting better when projeting next seasons home run total and if he’s over 29 than he is expected to start aging and get worse. I also created a HR Difference column to see the difference in home runs hit for a player from 2017 to 2018 to measure this, so if they hit less homers in 2018 than 2017 they will have a negative value.
OldAge <- HomeRuns$Age >= 29
HomeRuns4 <- HomeRuns[OldAge,]
ggplot(HomeRuns4,aes(x=HRDiff, fill=Age)) + geom_histogram(binwidth = 2)
## Warning: Removed 9 rows containing non-finite values (stat_bin).
ggplot(HomeRuns4,aes(x=HR1,y=HR)) + geom_point() + geom_smooth(method = 'lm')
## Warning: Removed 9 rows containing non-finite values (stat_smooth).
## Warning: Removed 9 rows containing missing values (geom_point).
ggplot(HomeRuns4) + geom_boxplot(aes(x = Age, y = HR)) + facet_wrap(~Age)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
That histogram looks like roughly what I’m looking for! There’s more guys on the negative side of the graph. Looks like being 30 really does mean you’re getting old. Now we just have to check the mean and median for the young players and hope that it’s less, and hopefully even positive, but due to variance we are looking for a number a few left of 0 for this variable and around 0 for the young guys. Also we can see the pattern of the average typically going down in the boxplot.
mean(na.omit(HomeRuns4$HRDiff))
## [1] -4
median(na.omit(HomeRuns4$HRDiff))
## [1] -4
-4 is both the mean and median which is good because that means the average over 29 player lost 4 homers from 2017 to 2018. That justifies my theory.
YoungAge <- HomeRuns$Age <= 29
HomeRuns5 <- HomeRuns[YoungAge,]
ggplot(HomeRuns5,aes(x=HRDiff, fill=Age)) + geom_histogram(binwidth = 2)
## Warning: Removed 72 rows containing non-finite values (stat_bin).
ggplot(HomeRuns5,aes(x=HR1,y=HR)) + geom_point() +geom_smooth(method = 'lm')
## Warning: Removed 72 rows containing non-finite values (stat_smooth).
## Warning: Removed 72 rows containing missing values (geom_point).
ggplot(HomeRuns5) + geom_boxplot(aes(x = Age, y = HR)) + facet_wrap(~Age)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
mean(na.omit(HomeRuns5$HRDiff))
## [1] -0.9817073
median(na.omit(HomeRuns5$HRDiff))
## [1] -1
There’s now the young players above. The mean and median are roughly -1, which is pretty much 0, especially considering guys who may skew the data due to injuries. This Age2 variable will penalize older player in my regression model which is basically what I am looking for it to do, so it’s not perfect becuase I’d rather have the average and median for this group to be positive, it will work for my model for sure. The average typically goes here on the boxplots too except for the pesky 26 and 27 year olds. Seems like Tango was right on his cutoff point though.
ggplot(HomeRuns,aes(x=HR,y=PA)) + geom_point() + geom_smooth(method = 'lm')
Now I just simply made a scatter plot of each one of the other variables to graphically see a relationship, I know this is what my Forecasting professor, Professor Roumani would want me to do. We can see plate appearances really line up well with home runs, even almost as good as previous seasons home run totals. This makes sense because you need to get up to hit in order to hit home runs, you can’t do that from the bench. There’s a trend here, more PAs, more longballs.
ggplot(HomeRuns,aes(x=HR,y=wFB)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=wCB)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=wSL)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=wCH)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=wCT)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=wSF)) + geom_point() + geom_smooth(method = 'lm')
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
Well it turns out being able to hit a fastball good really helps you hit home runs in the MLB. It is the fastest and most common pitch, so being able to hit well leads to homers. Every other pitch has a slight positive slope, which makes sense because if you hit a changeup really well, you likely will hit more home runs, but overall the other pitches are not very good indicators of home runs since I expect a positive slope I’d almost need more to put this in my model, though I might make a model just based on how well you can hit each pitch and we can see what pitches are actually significant.
ggplot(HomeRuns,aes(x=HR,y=Doubles)) + geom_point() + geom_smooth(method = 'lm')
My dad used to help me study and draft my fantasy teams, we still do a league together with some Texas Rangers beat writers (his best friend moved to Texas) and one of the first bits of advice he told me for drafting in the late round was if I was looking for power hitters that were going to breakout to look at doubles because if a guy hits a lot of doubles then he likely just missed those hits for home runs since doubles are usually hit deep. It’s nowhere near perfect, especially with the Statcast numbers now where you can know how hard and far every hit goes, but sure enough there’s some pretty high correlation here.
ggplot(HomeRuns,aes(x=HR,y=BB_K)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=BBPct)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=KPct)) + geom_point() + geom_smooth(method = 'lm')
Strikeouts and walks are two of three true outcomes in baseball along with home runs. This means these are the only outcomes a hitter has complete control of. In today’s game a lot of players strikeout often and it is often attributed to lower averages but higher home runs. We can see that walks have a bit of a correlation but strikeouts don’t have much of one at all, the thing is that strikeouts actually seem to have a small positive trend line, so interestingly enough strikeout prone hitters tend be better power hitters on average. Taking a walk compared to a strikeout is seen as a sign of a good eye, but we can see this barely translates into hitting for power.
ggplot(HomeRuns,aes(x=HR,y=OBP)) + geom_point() + geom_smooth(method = 'lm')
Home Runs calculate into this stat by times on base so there is some multicollinearity here, but not too much as these weigh the same as any other time on base. Being a good hitter does have somewhat of a relationship with hitting homers.
ggplot(HomeRuns,aes(x=HR,y=BABIP)) + geom_point() + geom_smooth(method = 'lm')
There’s really nothing to see here, despite the fact that you’d think having a good average on balls in play would lead to having more homers.
ggplot(HomeRuns,aes(x=HR,y=OSwingPct)) + geom_point() + geom_smooth(method = 'lm')
These are pitches outside of the strikezone that are swung at, so the freeswingers have no relationship with home runs, but it doesn’t hurt them to swing often, even if they swing at a lot of bad pitches.
ggplot(HomeRuns,aes(x=HR,y=ZSwingPct)) + geom_point() + geom_smooth(method = 'lm')
Again, swinging often helps you and it does help you slightly more if you swing at pitches in the zone.
ggplot(HomeRuns,aes(x=HR,y=SwingPct)) + geom_point() + geom_smooth(method = 'lm')
This is a case of two groups cancelling each other out, there’s really selective power hitters like Joey Votto and guys who swing at everything like Khris Davis whom both have power, so how much you swing really has no effect on home runs.
ggplot(HomeRuns,aes(x=HR,y=OContactPct)) + geom_point(color = "red") + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=ZContactPct)) + geom_point(color= "dark red") + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=ContactPct)) + geom_point() + geom_smooth(method = 'lm')
These graphs all show that being a good contact hitter might actually hurt you in the long run if you’re trying to exclusively hit home runs because making contact could just mean weak contact for outs. There’s actually a beauty to swinging hard just in case you hit the ball in today’s game it seems. The type of contact you make is more important than making it or where you make it.
ggplot(HomeRuns,aes(x=HR,y=ZonePct)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=FStrikePct)) + geom_point() + geom_smooth(method = 'lm')
This proves that good power hitters truly see less first pitch strikes and strikes altogether, something interesting to keep in mind when I make my model because a sign of a power hitter could be lower percentages when it comes to these stats.
ggplot(HomeRuns,aes(x=HR,y=SwStrPct)) + geom_point() + geom_smooth(method = 'lm')
Another graphical example of how swinging leads to more homers, even if you miss the ball often.
ggplot(HomeRuns,aes(x=HR,y=PullPct)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=CentPct)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=OppoPct)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=SoftPct)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=MedPct)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=HardPct)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=GB_FB)) + geom_point() + geom_smooth(method = 'lm')
More fly balls compared to grounder = a better chance at a home run. Just look at the trend, and a negative trend is actually good for home runs in this case.
ggplot(HomeRuns,aes(x=HR,y=GB_FB)) + geom_point(color =ifelse(((abs(HomeRuns$GB_FB)>1.5)),"pink", "black")) + geom_smooth(method = 'lm')
Just wanted to make this graph more appealing to the eye and show the importance to keeping the ball in air even more.
ggplot(HomeRuns,aes(x=HR,y=LDPct)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=GBPct)) + geom_point() + geom_smooth(method = 'lm')
ggplot(HomeRuns,aes(x=HR,y=FBPct)) + geom_point() + geom_smooth(method = 'lm')
Groundballs definitely do not equate to home runs, and I think this will be the key negative number variable in a model that I’ll allow negaive variables in, I also have a feeling that people who hit a lot of flyballs will lead FBPct be in my key best model, even though the trend line looks skewed upwards a bit here. I’m actually surprised about line drives having that much of a negative effect because they usually indicate a very good hitter, but it looks like I’m going nowhere with that.
ggplot(HomeRuns,aes(x=HR,y=HR_FB)) + geom_point() + geom_smooth(method = 'lm')
Probably not going to use this in my model since it directly involves home runs, but there’s a league average for this number and chances are if a player is above the trendline that there homers will decline the next season, or they might consistently outperform this average due to an uppercut swing or playing in a park that favors hitters and so on.
Overall, I have a pretty good feel on my variables and what I’m working with now and definitely a better idea of what models I want to build, even if I don’t get a really great model I think I can learn some important things about these numbers and how they can project home runs, but the goal is to build a great model to use for fantasy baseball predictions (and then to get an A on this project and signed to an MLB team to do statistical analysis).
save(HomeRuns, file = "HomeRunsNewdf.rdata")